In this workshop, we will do several steps to visualize metagenomics data using R. we will start with data generated in previous session, followed by abundance quantification and statistical analysis.

Content

Tips:

1. what kinds of plots we will find in metagenomic paper?

a. Bar plot illustrating the clinical covariates that are associated with variation in the neonatal gut microbiota on day 4 (n = 310 individuals), day 7 (n = 532 individuals), day 21 (n = 325 individuals) and in infancy (n = 302 individuals).b, Longitudinal changes in the mean relative abundance of genera of faecal bacteria, sampled on day 4, day 7, day 21 and in infancy, for genera with >1% mean relative abundance across all samples from the neonatal period.

Analysis of host attachment and growth rates of CPR/DPANN organisms.)

Heat map representing differentially abundant bacterial species (false discovery rate < 0.01) between HSPC and CRPC.

2. Visualization - work on your own data

library(phyloseq)
library(ggplot2)
library(tidyverse)
library(ggpubr)
library(ggsci)
library(stringr)
library(dutchmasters)
2.1 data structure of phyloseq object
load('phyloseq_object_data_2.Rdata')

  • SampleData: include sample information
  • OTUdata: include abundance of sample
  • TAXAData: include microbiome information
phylo_obj_2@sam_data[1:3,1:3]
##                SampleID  Group    Condition
## SRR16348844 SRR16348844   BP-2  Blautia-HFD
## SRR16348845 SRR16348845   BP-1  Blautia-HFD
## SRR16348846 SRR16348846 Norm-8 Control-Chow
phylo_obj_2@otu_table[1:3,1:3]
## OTU Table:          [3 taxa and 3 samples]
##                      taxa are rows
##       SRR16348844 SRR16348845 SRR16348846
## OTU 1      511626        1230     1782021
## OTU 2       69651       30157           0
## OTU 3       80033      150153           0
phylo_obj_2@tax_table[1:3,1:3]
## Taxonomy Table:     [3 taxa by 3 taxonomic ranks]:
##       Kingdom       Phylum               Class                     
## OTU 1 "k__Bacteria" "p__Verrucomicrobia" "c__Verrucomicrobiae"     
## OTU 2 "k__Bacteria" "p__Proteobacteria"  "c__Epsilonproteobacteria"
## OTU 3 "k__Bacteria" "p__Firmicutes"      "c__Erysipelotrichia"
2.2 convert the three tables to a combined data frame for plotting using function psmelt()
pd <- psmelt(phylo_obj_2)
pd[1:7,1:7]
##         OTU      Sample Abundance    SampleID  Group    Condition     Kingdom
## 2     OTU 1 SRR16348866   2917737 SRR16348866 Norm-2 Control-Chow k__Bacteria
## 18    OTU 1 SRR16348846   1782021 SRR16348846 Norm-8 Control-Chow k__Bacteria
## 3     OTU 1 SRR16348855   1577778 SRR16348855 Norm-3 Control-Chow k__Bacteria
## 1     OTU 1 SRR16348844    511626 SRR16348844   BP-2  Blautia-HFD k__Bacteria
## 21    OTU 1 SRR16348849    492468 SRR16348849 Norm-5 Control-Chow k__Bacteria
## 35   OTU 10 SRR16348857    401402 SRR16348857  HFD-3  Control-HFD k__Bacteria
## 1129 OTU 58 SRR16348852    392905 SRR16348852  HFD-7  Control-HFD k__Bacteria
  • 1890 obs. of 13 variables
pd = pd %>% mutate(Kingdom = str_replace(Kingdom, 'k__',''),
                   Phylum = str_replace(Phylum, 'p__',''),
                   Class = str_replace(Class, 'c__',''),
                   Order = str_replace(Order, 'o__',''),
                   Family = str_replace(Family, 'f__',''),
                   Genus = str_replace(Genus, 'g__',''),
                   Species = str_replace(Species, 's__',''))

2.3 which columns and which function will be used for bar plot?

colnames(pd)
##  [1] "OTU"       "Sample"    "Abundance" "SampleID"  "Group"     "Condition"
##  [7] "Kingdom"   "Phylum"    "Class"     "Order"     "Family"    "Genus"    
## [13] "Species"

In R, we use ggplot2, learn this package - Tutorial - For bar plot, we will use geom_bar(), use ‘?geom_bar’ in Rstudio to check the usage of this function - geom_bar() makes the height of the bar proportional to the number of cases in each group

# create a basic bar plot
# The abundance of species in each sample
ggplot(data = pd, aes(x = SampleID, y = Abundance)) + 
  geom_bar(stat = 'identity')

Q:the label of sampleID was stacked, how to solve it?

  • use theme() function to modify it.
ggplot(data = pd, aes(x = SampleID, y = Abundance)) + 
  theme_bw() +
  geom_bar(stat = 'identity') + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Q: how to add group information

  • use fill() within geom_bar() to add it.
ggplot(data = pd, aes(x = SampleID, y = Abundance)) + 
  geom_bar(stat = 'identity', aes(fill = Condition)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Q: how to change width of bar

  • add width argument within geom_bar().
ggplot(data = pd, aes(x = SampleID, y = Abundance)) + 
  geom_bar(stat = 'identity', aes(fill = Condition),width = 0.2) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Q: how to custom the color of bar

  • use scale_fill_manual() function.
  • A color can be specified either by name (e.g.: “red”) or by hexadecimal code (e.g. : “#FF1234”)
  • please keep in mind that the number of colors you assign on your data should be same as the number of groups, in this case, it is 3.
  • color source
ggplot(data = pd, aes(x = SampleID, y = Abundance)) + 
  geom_bar(stat = 'identity', aes(fill = Condition),width = 0.8) + 
  #scale_fill_manual(values = c("blue","red","green")) +
  scale_fill_manual(values = c("#b3e2cd","#fdcdac","#cbd5e8")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Q: The distribution of class in those groups

  • change the formula to (Class ~ Abundance)
ggplot(data = pd, aes(x = Class, y = Abundance)) + 
  geom_bar(stat = 'identity', aes(fill = Condition),width = 0.8) + 
  #scale_fill_manual(values = c("blue","red","green")) +
  scale_fill_manual(values = c("#b3e2cd","#fdcdac","#cbd5e8")) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Q: how to change the colors for mutiple classes

  • can use defined color palette by the scientific journals when the number of colors you need less than 10 (category)
  • for example use scale_fill_npg() or scale_fill_aaas() or scale_fill_lancet() or more in ggsci package
#length(unique(pd$Phylum))
#6
ggplot(data = pd, aes(x = Condition, y = Abundance)) + 
  geom_bar(stat = 'identity', position = "fill", aes(fill = Phylum),width = 0.8) + 
  scale_fill_jco()

#length(unique(pd$Class))
#13
ggplot(data = pd, aes(x = Condition, y = Abundance)) + 
  geom_bar(stat = 'identity', position = "fill", aes(fill = Class),width = 0.8) + 
  scale_fill_dutchmasters(palette = "milkmaid")

Q: how to show relative abundance of class in each group

  • add position = “fill” with geom_bar()
ggplot(data = pd, aes(x = Class, y = Abundance)) + 
  geom_bar(stat = 'identity', position = "fill", aes(fill = Condition),width = 0.8) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_jco()

Q: how to change the title of y-axis

  • add function labs() to define the title of x- or y-axis
ggplot(data = pd, aes(x = Class, y = Abundance)) + 
  geom_bar(stat = 'identity', position = "fill", aes(fill = Condition),width = 0.8) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_jco() +
  labs(x = "Group", y = "Relative abundance (%)")

Q: how to change the backgroud

ggplot(data = pd, aes(x = Class, y = Abundance)) + 
  geom_bar(stat = 'identity', position = "fill", aes(fill = Condition),width = 0.8) + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_jco() +
  labs(x = "Group", y = "Relative abundance (%)")

Q: how to change the position of legend.

  • add legend.position() within theme
ggplot(data = pd, aes(x = Condition, y = Abundance)) + 
  geom_bar(stat = 'identity', position = "fill", aes(fill =Phylum ),width = 0.8) + 
  theme_classic() +
  theme(legend.position = 'top') +
  scale_fill_jco() +
  labs(x = "Group", y = "Relative abundance (%)")

Q: how to save your figure

  • use ggsave()
fig <- ggplot(data = pd, aes(x = Condition, y = Abundance)) + 
  geom_bar(stat = 'identity', position = "stack", aes(fill =Phylum ),width = 0.8) + 
  theme_classic() +
  theme(legend.position = 'top') +
  scale_fill_jco() +
  labs(x = "Group", y = "Relative abundance (%)")
ggsave(fig, filename = 'Con2Abundance.pdf', width = 5, height = 4)

2.4 which columns and which function will be used for box plot?

  • use geom_boxplot()
load('Data2_Abundance_metadata.Rdata')
head(Metadata)
##      SampleID  Group    Condition Observed Chao1 se.chao1 ACE se.ACE    Shannon
## 1 SRR16348844   BP-2  Blautia-HFD       36    36        0 NaN    NaN 1.90764074
## 2 SRR16348845   BP-1  Blautia-HFD       30    30        0 NaN    NaN 2.26149321
## 3 SRR16348846 Norm-8 Control-Chow        9     9        0 NaN    NaN 0.08668567
## 4 SRR16348849 Norm-5 Control-Chow       20    20        0 NaN    NaN 1.96953105
## 5 SRR16348850 Norm-4 Control-Chow       21    21        0 NaN    NaN 1.86096898
## 6 SRR16348851  HFD-8  Control-HFD       21    21        0 NaN    NaN 2.05877370
##      Simpson InvSimpson
## 1 0.69315682   3.258994
## 2 0.84753991   6.559094
## 3 0.02409417   1.024689
## 4 0.77738789   4.492119
## 5 0.79825707   4.956803
## 6 0.80557316   5.143323
ggplot(data = Metadata, aes(x = Condition, y = Shannon)) + 
  geom_boxplot(aes(fill = Condition))

Q: how to evalute the difference of Shannon between groups, is it significant?

  • use stat_compare_means() in ggpubr package for statistical analysis
my_comparisons = list(c("Blautia-HFD","Control-HFD"),c("Control-HFD","Control-Chow",""))
ggplot(data = Metadata, aes(x = Condition, y = Shannon)) + 
  geom_boxplot(aes(fill = Condition)) +
  scale_fill_jco() +
  stat_compare_means(comparisons = my_comparisons, method = "wilcox.test",aes(as_label(..p.format..))) # ..p.signif..

  • please be careful when you use the method for statistics analysis (Student T-test or wilcox.test)

Learn more about boxplot in ggplot2

2.5 heatmap using phyloseq package

  • use function plot_heatmap
plot_heatmap(phylo_obj_2, method = "NMDS", distance = "bray")
## Warning: Transformation introduced infinite values in discrete y-axis

Q: how to show the most abundant OTUs

total = median(sample_sums(phylo_obj_2))
abundant_OTU <- filter_taxa(phylo_obj_2, function(x) sum(x > total*0.20) > 0, TRUE)
abundant_OTU
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 12 taxa and 21 samples ]
## sample_data() Sample Data:       [ 21 samples by 3 sample variables ]
## tax_table()   Taxonomy Table:    [ 12 taxa by 7 taxonomic ranks ]
plot_heatmap(abundant_OTU, method = "NMDS", distance = "bray")
## Warning: Transformation introduced infinite values in discrete y-axis

- you can also convert ‘abundant_OTU’ object to a read friendly table.

pd_abundant  = psmelt(abundant_OTU)
head(pd_abundant)
##       OTU      Sample Abundance    SampleID  Group    Condition     Kingdom
## 15  OTU 1 SRR16348866   2917737 SRR16348866 Norm-2 Control-Chow k__Bacteria
## 12  OTU 1 SRR16348846   1782021 SRR16348846 Norm-8 Control-Chow k__Bacteria
## 3   OTU 1 SRR16348855   1577778 SRR16348855 Norm-3 Control-Chow k__Bacteria
## 1   OTU 1 SRR16348844    511626 SRR16348844   BP-2  Blautia-HFD k__Bacteria
## 11  OTU 1 SRR16348849    492468 SRR16348849 Norm-5 Control-Chow k__Bacteria
## 34 OTU 10 SRR16348857    401402 SRR16348857  HFD-3  Control-HFD k__Bacteria
##                Phylum               Class                 Order
## 15 p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales
## 12 p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales
## 3  p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales
## 1  p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales
## 11 p__Verrucomicrobia c__Verrucomicrobiae o__Verrucomicrobiales
## 34      p__Firmicutes          c__Bacilli    o__Lactobacillales
##                 Family            Genus                    Species
## 15  f__Akkermansiaceae   g__Akkermansia s__Akkermansia_muciniphila
## 12  f__Akkermansiaceae   g__Akkermansia s__Akkermansia_muciniphila
## 3   f__Akkermansiaceae   g__Akkermansia s__Akkermansia_muciniphila
## 1   f__Akkermansiaceae   g__Akkermansia s__Akkermansia_muciniphila
## 11  f__Akkermansiaceae   g__Akkermansia s__Akkermansia_muciniphila
## 34 f__Lactobacillaceae g__Lactobacillus s__Lactobacillus_johnsonii
  • you can also change the name of y-axis and define the color
plot_heatmap(abundant_OTU, method = "MDS", distance = "(A+B-2*J)/(A+B-J)", 
               taxa.label = "Order", taxa.order = "Order", 
               trans=NULL, low="beige", high="red", na.value="beige")

2.6 heatmap in ggplot2

  • use geom_tile() in ggplot2
ggplot(data = pd_abundant, aes(x=SampleID, y=Class, fill = Abundance)) +
  geom_tile() +
  scale_fill_continuous(low = 'beige', high = 'red') + 
  theme(axis.text.x = element_text(angle = 270, vjust = 0.5))

  • it is more flexible using ggplot.

2.7 Simple network analysis

plot_net(phylo_obj_2, distance = "(A+B-2*J)/(A+B)", type = "taxa",
         maxdist = 0.7, color="Class", point_label="Genus")

-End, hopefully you enjoyed this session!!